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Since their emergence in the 1990's, the support vector machine and the AdaBoost algorithm have 
spawned a wave of research in statistical machine learning. Much of this new research falls into one of two 
broad categories: kernel methods and ensemble methods. In this expository article, I discuss the main ideas 
behind these two types of methods, namely how to transform linear algorithms into nonlinear ones by using 
kernel functions, and how to make predictions with an ensemble or a collection of models rather than a 
single model. I also share my personal perspectives on how these ideas have influenced and shaped my own 
research. In particular, I present two recent algorithms that I have invented with my collaborators: LAGO, 
a fast kernel algorithm for unbalanced classification and rare target detection; and Darwinian evolution in 
parallel universes, an ensemble method for variable selection. 

Key Words: AdaBoost; kernel PCA; LAGO; parallel evolution; random forest; SVM. 

1 Introduction 

The 1990's saw two major advances in machine learning: the support vector machine (SVM) and the 
AdaBoost algorithm. Two fundamental ideas behind these algorithms are especially far-reaching. 
The first one is that we can transform many classical linear algorithms into highly flexible nonlinear 
algorithms by using kernel functions. The second one is that we can make accurate predictions by 
building an ensemble of models without much fine-tuning for each, rather than carefully fine-tuning 
a single model. 

In this expository article, I first present the main ideas behind kernel methods (Section 2) and 
ensemble methods (Section 3) by reviewing four prototypical algorithms: the support vector ma- 
chine (SVM, e.g., Cristianini and Shawe- Taylor 2000), kernel principal component analysis (kPCA, 
Scholkopf et al. 1998), AdaBoost (Freund and Schapire 1996), and random forest (Breiman 2001). 
I then illustrate the influence of these ideas on my own research (Section 4) by highlighting two 
recent algorithms that I have invented with my collaborators: LAGO (Zhu et al. 2006), a fast kernel 
machine for rare target detection; and Darwinian evolution in parallel universes (Zhu and Chipman 
2006), an ensemble method for variable selection. 

To better focus on the main ideas and not be distracted by the technicalities, I shall limit myself 
mostly to the two-class classification problem, although the SVM, AdaBoost and random forest can 
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all deal with multi-class classification and regression problems as well. Technical details that do not 
affect the understanding of the main ideas are also omitted. 

2 Kernels 

I begin with kernel methods. Even though the idea of kernels is fairly old, it is the support vector 
machine (SVM) that ignited a new wave of research in this area over the past 10 to 15 years. 

2.1 SVM 

In a two-class classification problem, we have predictor vectors Xj e K d and class labels yt £ 
{— 1, +1}, i = 1, 2, n. SVM seeks an optimal hyperplane to separate the two classes. 
A hyperplane in R d consists of all x £ M. d that satisfy the linear equation: 

.f(x)=/3 T x + /?o = 0. 

Given Xj £ M. d and j/j £ {— 1,+1}, a hyperplane is called a separating hyperplane if there exists 
c > such that 

y,(/3 T x t +/3 ) >c V i = l,2,..., n. (1) 
Clearly, a hyperplane can be reparameterized by scaling, e.g., 

/3 T x + p = is equivalent to s(/3 T x + j3 ) = 
for any scalar s. In particular, we can scale the hyperplane so that (1) becomes 

2/ l (/3 T x 4 +/3 ) > 1 V* = l,2,...,n, (2) 

that is, scaled so that c = 1. A separating hyperplane satisfying condition (2) is called a canonical 
separating hyperplane (CSHP). 

If two classes are perfectly separable, then there exist an infinite number of separating hyper- 
planes. Figure 1 shows two competing hyperplanes in such a situation. The SVM is based on the 
notion that the "best" canonical separating hyperplane to separate two classes is the one that is the 
farthest away from the training points. This notion is formalized mathematically by the margin of a 
hyperplane — hyperplanes with larger margins are better. In particular, the margin of a hyperplane 
is equal to 

margin = 2 x min{yidi, i = 1, 2, ...,n}, 

where di is the signed distance between observation x$ and the hyperplane; see Figure 1 for an 
illustration. Figure 1 also shows to a certain extent why large margins are good on an intuitive level; 
there is also an elaborate set of theories to justify this (see, e.g., Vapnik 1995). 
It can be shown (e.g., Hastie et al. 2001, Section 4.5) that di is equal to 

d i = ^(/3 T x i + /3 ). (3) 

Then, equations (2) and (3) together imply that the margin of a CSHP is equal to 

2 

margin = 2 X min{yidi} = -rr^rr- 
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Margin (Worse) 
Margin (Better) 

Figure 1: Two separating hyperplancs, one with a larger margin than the other. 

To find the "best" CSHP with the largest margin, we are interested in solving the following opti- 
mization problem: 



n 

min ||/3|| 2 + 7 ^ 



(4) 



i=i 



subject to j/i(/3 T Xi + (3 ) > 1 — £ 4 and £j > V i. 



(5) 



The extra variables & are introduced to relax the separability condition (2) because, in general, we 
can't assume the two classes are always perfectly separable. The term 7 £» acts as a penalty to 
control the degree of such relaxation, and 7 is a tuning parameter. 

The main message from the brief introduction above is this: SVM tries to find the best CSHP; 
it is therefore a linear classifier. The usual immediate response to this message is: So what? How 
does this make the SVM much different from and superior to classical logistic regression? 

Equivalently, the constrained optimization problem above can be written as (e.g., Hastie et al. 
2001, Exercise 12.1) 



mm 



n 

^[l- yj (/3 T x,+/3o)] + A||/3|| 



(6) 



where 



z + = 



z if z > 0, 
if z < 0. 



For statisticians, the objective function in (6) has the familiar form of a loss function plus a penalty 
term. For the SVM, the loss function is [1 — y(/3 T x + /3o)]-h and it is indeed very similar to the 
binomial log-likelihood used by logistic regression (e.g., Hastie et al. 2001, Figure 12.4). But the 
usual logistic regression model does not include the penalty term A||/3|| 2 . This is the familiar ridge 
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penalty and often stabilizes the solution, especially in high-dimensional problems. Indeed, this gives 
the SVM an advantage. 

However, one can't possibly expect a linear classifier to succeed in general situations, no matter 
how optimal the hyperplane is. So, why is the SVM such a sensational success? 

2.2 The "kernel trick" 

Cristianini and Shawe- Taylor (2000, Chapters 5 and 6) provided detailed derivations to show that 
the optimal (3 looks like this: 

iesv 

where "SV" denotes the set of "support vectors" with a* > strictly positive; the coefficients 
Qj, i = 1, 2, n, are solutions to the (dual) problem: 

n ^ n n 

max a i ~ 2 Yl S <Xi<XjViVi*iXj ( 7 ) 

i—l i—l j — 1 

n 

s.t. y^a,y,=0 and on > V i. (8) 

i=l 

This means the resulting hyperplane can be written as 

/(x) = /3 T x + A) - a *V^ x + A) = 0. (9) 

iesv 

The key point here is the following: In order to obtain a^, one solves (7)-(8), a problem that 
depends on the predictors Xj only through their inner-products xfx^; once the a^s are obtained, 
the ultimate decision function (9) is also just a function of inner-products in the predictor space. 

Therefore, one can make SVM a lot more general simply by defining a "different kind of inner- 
product," say, Kh{u; v), in place of u T v. The function Kh(u; v) is a called a kernel function, where h 
is a hyper-parameter, which is often determined empirically by cross-validation. Then, (7) becomes 

n _^ n n 

max - -^^aiajyiyjKh^Xj) (10) 

i—l i—l j—1 

and the decision function (9) becomes 

/(x) = a iyi K h (x;Xi) + fa = 0. (11) 

iesv 

The boundary is linear in the space of 0(x) where (/>(•) is such that 

K h (u;v) -0(u) T 0(v), 

but generally it is nonlinear in the original predictor space (unless one picks a linear kernel function). 
Mercer's theorem (Mercer 1909) guarantees the existence of such </>(•) as long as Kh is a non-negative 
definite kernel function. The beauty here is that we don't even need to define the mapping <p(-) 
explicitly; all we have to do is to pick a kernel function Kh(u; v). This makes the SVM very general. 
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2.3 Kernelization of linear algorithms 

That we can apply a linear method in a different space is, of course, not a new idea to statisticians 
at all. For example, we all know how to fit a high-order polynomial using linear regression — simply 
add the terms x 2 , x 3 , x d to the regression equation! 

The idea that we don't need to explicitly create these high-order terms is perhaps somewhat less 
familiar. Actually, it is not really a new idea, either; it is less familiar only in the sense that students 
usually don't learn about it in "Regression Analysis 101." 

However, the SVM does deserve some credit in this regard. Even though the basic idea of kernels 
is fairly old, it is the SVM that has revived it and brought it back into the spotlight for applied 
statisticians. The basic idea is as follows. 

A typical data matrix we encounter in statistics, X, is nxd, stacking n observations Xi, x 2 , x„ e 
R d as d-dimcnsional row vectors. That is, 



X 



It is easy to see that 



( *i \ 



XX J = 



x 



(xi x 2 



x„) = 



/ xfxi xfx 2 

T T 
X^ Xi x^ x 2 



T \ 
x i x n \ 



Xn X. 



2 



T T 

\ x„xi x^x 2 



/ 



is an n x n matrix of pairwise inner-products. Therefore, if a linear algorithm can be shown to 
depend on the data matrix X only through 



K = XX T , 



(12) 



then it can be easily "kernelized" — we simply replace each inner-product entry of K with Kij — 
if/j(xj,Xj), where Kh{-, •) is a desired kernel function. 



2.4 Kernel PCA 

Kernel principal component analysis (kPCA; Scholkopf et al. 1998) is a successful example of "ker- 
nclizing" a well-known classic linear algorithm. To focus on the main idea, let us assume that the 
data matrix X is already centered so that each column has mean zero. Let 



S = X J X. 



(13) 



Then, the (ordered) eigenvectors of S, say Ui, u 2 , u^, are the principal components. Being eigen- 
vectors, they satisfy the equations 



Suj = XjUj, j = l,2,...,d. 
Equations (13) and (14) together lead to 

X T Xuj = XjUj, j = 1, 2, d. 



(14) 
(15) 
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This shows that Uj can be represented in the form of X T otj — by letting otj = Xuj/Xj, to be 
specific. We will plug Uj — X T otj into (15) and reparameterize the eigenvalue problem in terms of 
a 3 . 

For j = 1, 2, d, this leads to 

X T XX T a j = \ 3 X T a r (16) 
If we left-multiply both sides by X, we get 

XX T XX T aj = X 3 XX T a j7 

or simply 

K 2 ai = XjKaj, (17) 

which shows that otj can be obtained by solving a problem that depends on the data matrix only 
through the inner-product matrix K. 

Scholkopf et al. (1998) explained why, in the context of kPCA, it is sufficient to reduce (17) to 
Kq 3 = \jOij] I do not go into this detail here. Once we obtain the exj's, suppose we'd like to project 
new data X new onto a few leading principal components, e.g., X new Uj. We immediately find that 

Xfie-y; \lj — X. new X Oij , 

and it is easily seen that X new X T is just a matrix of pairwise inner products between each new and 
old observations. 

Hence, it becomes clear that both finding and projecting onto principal components depend on 
just the inner-products and, according to Section 2.3, PCA can be "kernelized" easily. Figure 2 
shows a toy example. There are some spherical data in R 2 . The data being spherical, all directions 
have equal variance and there are no meaningful principal components in the traditional sense. But 
by using a Gaussian kernel — equation (18) below with h = 1 — in place of all the inner-products, 
the first kernel principal direction obtained gives a meaningful order of how far each observation is 
away from the origin. In this case, kernel PCA has successfully discovered the (only) underlying 
pattern in the data, one that is impossible to detect with classical PCA. 

2.5 Discussion: Kernel methods are like professional cameras 

Any acute reader must have noticed that, so far, I have never really discussed the kernel function 
_ftT/i(u;v) explicitly. This is not an accident. It is often claimed that one important advantage of 
these kernel methods lies in their modularity: to solve a different problem, just use a different kernel 
function. Any discussion about kernel functions, therefore, is best carried out in the context of a 
specific problem. 

Of course, to be effective in practice, we must use the right kernel function. What's more, we 
must choose the right hyper-parameter h as well, and the performance of the method can be quite 
sensitive to these choices in practice. These are no trivial tasks and often require a considerable 
amount of data analytic experience as well as knowledge of the specific application area. 

In this regard, these kernel-based algorithms are very much like professional cameras. They are 
capable of producing great pictures even under very difficult conditions, but you need to give them 
to a professional photographer. If you give them to an amateur or novice, you can't expect great 
pictures. The photographer must know how to select the right lens, set the right shutter speed, and 
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(a) Original Data (b) kPCA Projection 
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Figure 2: Kernel PCA, toy example, (a) Original data, (b) Projection onto the first two kernel 
principal components. 

use the right aperture for any given condition. If any of these parameters is not set appropriately, the 
result could be a disaster. But that does not mean the camera itself is a poor piece of equipment; it 
simply means one must be adequately trained to operate it. Much of the power of these professional 
cameras lies precisely in the fact that they allow a knowledgeable and experienced user to control 
exactly how each single picture should be taken. 

2.5.1 Example: Spam data 

As a very simple illustration, let us try to see how well the SVM can predict on the spam data 
set, available at http://www-stat.stcinford.edu/~tibs/ElemStatLearn/index.html. There are 
a total of n = 4,601 observations, each with a binary response and d = 57 predictors. For more 
details about this data set, refer to the aforementioned web site. I use an R package called el071 to 
fit SVMs and use the kernel function 

K h (u;v) =exp{-h||u- v|| 2 }. (18) 

A random sample of 1,536 observations are used as training data and the remaining 3,065 
observations are used as test data. Using different values of 7 and h, a series of SVMs are fitted on 
the training data and then applied to the test data. The total number of misclassification errors on 
the test data are recorded and plotted for each pair of (7, h); see Figure 3(a). Here, 7 is the penalty 
parameter in equation (4). 

Figure 3(a) shows that the performance of SVM using this particular kernel function is very 
sensitive to the parameter h but not as sensitive to the parameter 7. Given h, the prediction 
performance of SVM is often quite stable for a wide range of 7's, but bad choices of h can lead 
to serious deteriorations in the prediction performance. Therefore, if one uses the SVM without 
carefully tuning the parameter h, the result can be disastrous. 
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(a) Mistakes on Test Set: SVM 



(b) Mistakes on Test Set: Random Forest 




Figure 3: Spam data example, (a) SVM: Number of misclassification errors on test data as a 
function of two tuning parameters, 7 and h (see Section 2.5.1). (b) Random forest: Number of 
misclassification errors on test data as a function of two tuning parameters, m and B (see Section 
3.5.1). 

3 Ensembles 

I now turn to ensemble methods. Again, I shall mainly focus on the two-class classification problem 
with predictor vectors Xj G R d and class labels yt G {— 1,+1}, i = 1,2, ...,n. 

3.1 AdaBoost 

AdaBoost constructs a collection of classifiers rather than one single classifier. The entire collection 
makes up an ensemble, and it is the ensemble — not any single classifier alone — that makes the 
final classification. 

Table 1 contains an exact description of the AdaBoost algorithm. Here is a description of the 
algorithm in plain English: Start by assigning equal weights to all observations in the training data. 
Sequentially build a series of classifiers. At each step, fit a classifier, say to the training data 
using the current weights. Calculate the (properly weighted) right-to-wrong ratio of this classifier; 
call it Rb- For those observations incorrectly classified by /&, inflate their weights by a factor of Rt,. 
With the new weights, build the next classifier. In the end, each classifier /j, in the ensemble will 
cast a vote; its vote is to be weighted by the logarithm of its right-to-wrong ratio, log(-Rb). 

For people hearing about this algorithm for the very first time, AdaBoost certainly has a very 
strong mystical flavor to it. Intuitively, we can perhaps appreciate to some extent that the right-to- 
wrong ratio must be important for any classifier, but it is not at all clear why we should reweight 
incorrectly classified observations by this ratio each time, nor is it immediately clear why the final 
vote from each individual member of the ensemble should be weighted by the logarithm of this ratio. 

This is no easy mystery to untangle. Friedman et al. (2000) gave a very nice argument and 
revealed that the AdaBoost algorithm actually minimizes an exponential loss function using a for- 
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Tabic 1: The AdaBoost Algorithm. 



1 



Initial weights: Wi = 1/n, V i. 



2 



For b = 1 to B: 



(a) Using weights Wi,i = 1,2, ...,n, fit a classifier /b(x) 6 {— 1, +1}. 

(b) Set 




(c) Update weights: Wi <— Wi x if y, / /b(x»). 



End For. 
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Output an ensemble classifier 



F(x) = sign ( ^ Ofc/fc(x) 



ward stagewise approach. In particular, AdaBoost chooses the best a& and /b one step at a time to 
minimize 



which they showed to be very similar to maximizing the binomial log-likelihood. This particular 
interpretation has not only untangled the AdaBoost mystery (at least to some extent), but also led 
to many new (and sometimes better) versions of boosting algorithms. 

3.2 Random forest 

Professor Leo Breiman came up with the same basic idea of using a collection or an ensemble 
of models to make predictions, except he constructed his ensemble in a slightly different manner. 
Breiman called his ensembles random forests; details are given in Table 2. 

The history behind Breiman's random forest is very interesting. In 1996, he first proposed an 
ensemble algorithm called Bagging (Breiman 1996), which is essentially the random forest algorithm 
with just the bootstrap step (Table 2, step la). In 2001, he added the random subset step (Table 2, 
step lb) and created random forest (Breiman 2001). 

Why did he add the extra random subset step? 

3.3 Breiman's theorem 

Breiman (2001) proved a remarkable theoretical result. First, he gave a formal definition of random 
forests: The set 



is called a random forest. 

This definition requires some explanation. Here, /(x; 9b) is a classifier completely parameterized 
by 6b- For example, if f(-;0b) is a classification tree, then the parameter 9b specifies all the splits 




{/(x; 6 ): 6 'A??*, 6=1,2, ... 
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Table 2: Brciman's Random Forest Algorithm. 



1. For each b = 1 to B, fit a maximal-depth tree, /i>(x), as follows: 

(a) (Bootstrap Step) Draw a bootstrap sample of the training data; call it D* b . 
Use D* b to fit f b . 

(b) (Random Subset Step) When building /&, randomly select a subset of m < d 
predictors before making each split — call it S, and make the best split over 
the set S rather than over all possible predictors. 

End For. 

2. Output an ensemble classifier, i.e., to classify x netu , simply take majority vote over 
{f b (x„ ew ),b=l,2,...,B}. 



and the estimates in the terminal nodes. Next, the statement "9b ~ Ve" means that each /(•; 9b) is 
generated independently and identically from some underlying random mechanism, V$. 

To be specific, in Breiman's implementation, iid sampling from the random mechanism Ve con- 
sists of: (i) iid sampling from the empirical distribution F n (the bootstrap step), and (ii) iid sampling 
from the set {1, 2, d} (the random subset step). 

Brciman then proved that the prediction error of a random forest, chf, satisfies the inequality 

*RF<p(^J-), (19) 



where p is the mean correlation between any two members of the forest (ensemble) and s, the mean 
strength of a typical member of the forest (ensemble) . This result — including the exact definitions 
of p and s — is fairly technical; details can be found in Brciman (2001). Moreover, the actual bound 
itself is often useless. For example, if s = 0.4 and p = 0.5, then one gets 

/l-5 2 \ /l-0.4 2 \ 
e RF < P (— ) = 0.5 (-^) = 2.625, 

but of course the error rate is less than 100%. 
So, why is this result significant? 



3.4 The secret of ensembles 

The fundamental idea of using an ensemble classifier rather than a single classifier is nothing short of 
being revolutionary. It also is remarkable that building these ensembles is often relatively mindless. 
Take Brciman's random forest, for example. There is no need to prune the individual trees. 

Clearly, there are many different ways to build an ensemble, AdaBoost and Breiman's random 
forest being two primary examples. What's the most effective way? 

Recall the formal definition of random forests. The random mechanism Vg that generates the 
individual members of the forest is unspecified. You are free to pick any mechanism you want. 
Surely some mechanisms are bound to be more effective than others. What's the most effective 
mechanism? 
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Breiman's result is significant because it tells us what makes a good random forest. Brciman's 
theorem (19) tells us that a good random forest should have a small p and a large s. That is, we 
should try to reduce the correlation between individual classifiers within the ensemble and make 
each individual classifier as accurate as possible. 

This explains why Breiman added the random subset step into his original Bagging algorithm: 
extra randomness is needed to reduce the correlation between individual trees; the bootstrap step 
alone is not enough! 

Interestingly, we can see that AdaBoost actually operates in a similar way. Going back to step 
(2b) in Table 1, we have 

Ei=l W i 

From this, we can write 

E w « = E ^ 

all wrong 



<b^Wi= ^2 w i and (1 - €b) ^ Wi = 2 , u : 



where "all" means i = 1,2, ...,n; "wrong" denotes the set {i : yt ^ /&(xi)} and "right," the set 
{i : yi — f (xi)}. Step (2c) in Table 1 gives the explicit update rule; the new weights are: 

7 fc £b ) ' for i e wrong; 

for i G right. 

Therefore, we can see that 




E < ew = (h^)^ Wi = (! - e ")E^ = E^ = E< e "' 

' wrong all right right 



which means the misclassification error of f under the new weights w" et0 is exactly 50% — the 
worst possible error. 

The next classifier, fb+i, is built using these new weights, so it is set up to work with a (weighted) 
dataset that the current classifier, /(,, cannot classify. This is sometimes referred to as "decoupling" 
in the boosting literature — the classifier fb+i is decoupled from /(,. 

In Brciman's language, we can say that the adaptive and hitherto mysterious reweighting mech- 
anism in AdaBoost is actually aiming to reduce the correlation between consecutive members of the 
ensemble. 



3.5 Discussion: Ensemble methods are like foolproof cameras 

Compared with kernel methods, ensemble methods are very much like foolproof cameras. They 
are relatively easy for the less experienced users to operate. This does not mean they don't have 
any tuning parameters; they do. Even when using a foolproof camera, one must still make a few 
decisions, e.g., whether or not to turn on the flash, and so on. But relatively speaking, the number 
of decisions one has to make is much more limited and these decisions are also relatively easy to 
make. 

For example, in Brciman's random forest, the size of the subset, m (Table 2, step lb), is an 
important tuning parameter. If m is too large, it will cause p to be too large. In the extreme case 
of m = d, all the trees in the forest will be searching over the entire set of variables in order to make 
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splits, and they will be identical — since the tree-growing algorithm is deterministic conditional on 
the data. On the other hand, if m is too small, it will cause s to be too small. In the extreme case 
of m = 1, all the trees will essentially be making random splits, and they will not be very good 
classifiers. There is plenty of empirical evidence to suggest, however, that the parameter m is still 
relatively easy to choose in practice. Moreover, the parameter m is not as sensitive as the complexity 
parameter h of a kernel function (also see Section 3.5.1 below). Translation: Even if you are a bit 
off, the consequences will not be quite so disastrous. 

I have had many occasions working with graduate students trying to make predictions using the 
SVM and Breiman's random forest. They almost always produce much better predictions with the 
random forest, even on problems that are well-suited for the SVM! Sometimes, their SVMs actually 
perform worse than linear logistic regression. Certainly, there are many cases in practice where one 
would not expect the SVM to be much superior to linear logistic regression, e.g., when the true 
decision boundary is in fact linear. But if used correctly, the SVM should at least be comparable 
with linear logistic regression; there is no reason why it ever would be much worse. These experiences 
remind me over and over again just how difficult it can be for a novice to use the SVM. 

But, as I stated in Section 2.5, you can't blame the professional camera if you don't know how 
to use it properly. There is always a tradeoff. With limited flexibility, even a fully-experienced pro- 
fessional photographer won't be able to produce images of the highest professional quality with just 
a foolproof camera, especially under nonstandard and difficult conditions. That's why professional 
cameras are still on the market. But we have to admit: most consumers are amateur photographers 
and, more often than not, they are taking pictures under fairly standard conditions. That's why 
the demand for foolproof cameras far exceeds that for professional cameras. I think the demand for 
statistical tools follows a similar pattern. 

3.5.1 Example: Spam data (continued) 

As a simple illustration, let us take a look at how well the random forest can predict on the spam 
data set. I use exactly the same set-up as in Section 2.5.1 and the randomForest package in R. 
Using different values of m and B, a series of random forests are fitted on the training data and 
then applied to the test data. The total number of misclassification errors on the test data are 
recorded and plotted; see Figure 3(b). Here, we can see that the performance of random forests is 
more sensitive to the parameter m than to the parameter B. Given m, the prediction performance 
of random forests is fairly stable as long as B is sufficiently large, e.g., B > 100 in this case. But it 
is important to use an m that is neither too small nor too big, e.g., 3 < m < 10 in this case. 

However, if we compare panels (a) and (b) in Figure 3, we can see that choosing the right h for 
SVM is much more critical than choosing the right m for random forest; performance deterioration 
is much more serious for bad choices of h than for bad choices of m. 

It is also clear from Figure 3 that, for this particular data set, an SVM with kernel function 
(18) is not competitive against a random forest, even if well tuned. In order to be competitive, it is 
necessary to use a different kernel function. I do not pursue this possibility here because getting the 
SVM to work for this data set is far from the main point of our discussion, but this example does 
demonstrate that choosing the right kernel function and picking the right hypcrparameter h are 
very important, and that an ensemble method such as the random forest can be somewhat easier to 
use in this regard. 
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4 Perspectives 



I now share a few personal perspectives on statistical learning research. Here, I am working with a 
particular definition of the word "perspective" from the American Heritage Dictionary: a subjective 
evaluation of relative significance [emphasis added]. 

4.1 Statistical learning research 

My discussions in Sections 2.5 and 3.5 have led me to ask the following question: If I were the presi- 
dent of a big camera manufacturing company, how would I run such a business? Other than standard 
business divisions such as accounting and human resources, I see three main lines of operation: 

1. (Consulting and Consumer Outreach) Advise and teach photographers how to use various 
products and how to use the right equipment to produce great pictures under various difficult 
conditions. This is my consulting and consumer outreach division. 

2. (High-end R&D) Understand the need of professional photographers and manufacture new, 
specialized equipment still lacking on the market. This is my R&D division for my high-end 



3. (Mass R&D) Build the next-generation foolproof camera. This is my R&D division for my 
mass consumers. 

I see a great deal of parallelism in statistical learning research. For statistical learning research, 
the consulting and consumer outreach division applies different learning methods to solve various 
difficult real-world problems; the high-end R&D division develops new, specialized algorithms for 
analyzing new types of data or data with special characteristics; and the mass R&D division develops 
better off-the-shelf learning algorithms. 

With this particular point of view in mind, I end this article by briefly describing two personal 
learning products: a new kernel method from my high-end R&D division, and a new ensemble 
method from my mass R&D division. 



Consider a two-class problem in which the class of interest (Ci) is very rare; most observations 
belong to a majority, background class (Co). Given a set of unlabelled observations, the goal is to 
rank those belonging to C\ ahead of the rest. 

Of course, one can use any classifier to do this as long as the classifier is capable of producing not 
only a class label but also an estimated posterior probability P(y € Ci|x) or a classification score. 
For example, the SVM does not estimate posterior probabilities, but the final decision function (11) 
is a classification score which can be used (at least operationally) to rank unlabelled observations 
— whether this is effective or not is a separate issue. 

4.2.1 RBFnets 

The final decision function produced by SVM (11) is of the form 



consumers. 



4.2 A high-end R&D product: LAGO 




(20) 
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where 0(x; /x, R) is a kernel function. For example, we can take R to be diagonal and let <j> be the 
Gaussian kernel 



0(x;/i, R) = — ; exp 

V(2^F|R| 



(x-/x) t R- 2 (x-/lx) 



(21) 



where |R| is the determinant of R. 

The function (20) is sometimes called a (single- layer) radial basis function network (RBFnet). 
Generally speaking, to construct an RBFnet one must compute and specify three ingredients: 

/x i7 the location parameter of each kernel function — together, they make up the set S; 

Ri, the shape parameter of each kernel function; and 

Pi, the coefficient in front of each kernel function. 

Typically, one first specifies [i i and Ri and then estimates the /Vs by least-squares or maximum 
likelihood. Often, one sets R^ = ri and treats the parameter r as a global tuning parameter - 
this is what SVM does. Determining the /x^'s or the best set 5* from training data, however, is an 
NP-hard combinatorial optimization problem in general. 

The SVM can be viewed as an algorithm for determining the set S and the /Vs simultaneously 
(Scholkopf et al. 1997); the set S = SV is simply the set of all support vectors. In order to do so, 
SVM solves a quadratic programming instead of a combinatorial optimization problem. 



4.2.2 LAGO 

The product from my R&D division is an algorithm called LAGO (Zhu et al. 2006). The decision 
function constructed by LAGO for ranking unlabelled observations is as follows: 

/(x)= IRiltftoxi.aRi). (22) 
The parameter a is a global tuning parameter. In the simplest case, we take 

r 4 = m, (23) 

where n is the average distance between the kernel center, x, <E Ci, and its if-nearest neighbors 
from Co, i.e., 

n = ^ E d (^^)- (24) 

The notation "N (xi, K)" denotes the if-nearest neighbors of x, from C ; and d(u, v) is a distance 
function, e.g., d(u, v) = ||u — v||. 

By comparing (22)- (23) with (20), we can easily see that LAGO can also be viewed as an 
algorithm for constructing an RBFnet, just like the SVM. In particular, the three ingredients of the 
RBFnet are specified as follows: 

fj,^. Every fi i is a training observation Xj from the rare class, C\. 
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R^: Each kernel function <f> is spherical with radius proportional to the average distance between 
its center £ C\ and its the if- nearest neighbors from Cq. 

fa: Simply set fa = and fa = |R 4 | V i > 0. 

Here we see that the only computation needed is the calculation of r, ; all other ingredients are com- 
pletely determined a priori. The calculation of r,, of course, is considerably simpler than quadratic 
programming, making LAGO many times faster and simpler than the SVM. Instead of solving an 
optimization problem to find support vectors, LAGO fully exploits the special nature of these rare- 
class detection problems and simply uses all training observations from the rare class as its "support 
vectors," a significant shortcut. Our empirical experiences show that the shortcut is highly worth- 
while. We find that LAGO almost always performs as well as and sometimes even better than the 
SVM for these rare-class classification and detection problems. 

Zhu et al. (2006) give a few theoretical arguments for why all these shortcuts are justified. 
Suppose pi(x) andpo( x ) are density functions of C\ and Cq. The main argument is that (22) can be 
viewed as a kernel density estimate of pi adjusted locally by a factor that is approximately inversely 
proportional to pg 7 i.e., |R»|. The resulting ranking function /(x) is thus approximately a monotonic 
transformation of the posterior probability that item x belongs to the rare class. 

The only nontrivial calculation performed by the algorithm, equation (24), is somewhat special 
and nonstandard. The original idea came from a Chinese board game called GO. Consider the 
two black stones labelled A and B in Figure 4. A GO player will tell you that B controls more 
territories on the board than A. Why? Because, when compared with B, A is closer to more enemy 
(white) stones. Therefore, imagine two classes fighting for control over a common space. Given an 
observation from C±, if we want to use a kernel function to describe its effective control over the 
entire space, we should use a large kernel radius if its nearby neighbors from Co are a long distance 
away and a small kernel radius if its nearby neighbors from Co are a short distance away. Equation 
(24) captures this basic principle. 
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Figure 4: The board game of GO. In this illustration, the black stone B controls more territory than 
the black stone A. 
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4.2.3 eLAGO versus sLAGO 



Instead of (23)-(24), the original LAGO paper (Zhu et al. 2006) used 

Ri =diag{rii,r i2 ,...,r i(j }, r l3 = ^ \xij-Wj\. (25) 

weA r (x i ,_ff ) 

That is, the kernel function <j> was chosen to be elliptical rather than spherical. To distinguish the 
two, we call (25) eLAGO and (23), sLAGO. For many real-world rare-class problems, the dataset 
often contains a limited amount of information because G\ is very rare. As such, the extra flexibility 
afforded by eLAGO is seldom needed in practice. 

4.2.4 Discussion: LAGO is a specialized kernel method 

LAGO is a kernel method, much like the SVM. There are two tuning parameters, K and a. Exper- 
iments similar to those described in Section 2.5.1 and Figure 3 have shown that the performance of 
LAGO is not very sensitive to K and much more sensitive to a. In practice, it often suffices to fix 
K = 5. 

LAGO is not a general-purpose method; it is a specialized algorithm for a special learning 
problem, namely rare-class classification and detection. Its main advantages are its speed and 
simplicity. Discussions in Section 2.5 have made it clear that these kernel methods must be carefully 
tuned, e.g., using empirical procedures such as cross-validation. This means that, in practice, one 
almost always has to run these algorithms repeatedly for many times. One may be tempted to think 
that, if one algorithm takes 10 minutes to run and another takes 1 minute, the difference is still 
"negligible" for all practical purposes, but such ten-fold differences are often significantly magnified 
if one has to run these two algorithms repeatedly for many times. 

Apart from these practical matters such as time savings, the more important lesson from this 
research lies in the basic principles behind the construction of LAGO (22). Here, we sec that it 
always pays to exploit the special nature of an underlying problem. For these rare-class problems, 
there is only limited amount of useful information in the training data. LAGO fully exploits this 
fact by immediately zooming into the useful information (i.e., Xj £ C\) and making a few quick local 
adjustments based on Vi — equation (24). 

4.3 A mass R&D product: Darwinian evolution in parallel universes 

Let us now consider a different problem, the variable selection problem. Given d potential predictors, 
which combination is the best for predicting y? Let fi be the space of all possible subsets of 
C = {xi,X2, ...,Xd}. The typical approach is as follows: First, define a proper evaluation criterion, 

F(u) : fl — > R. 

Preferably F should be a fair measure of uj £ fl. Common examples of F include the Akaike 
information criterion (AIC, Akaike 1973), the Bayesian information criterion (BIC, Schwarz 1978), 
and generalized cross-validation (GCV), to name a few. Then, use a search algorithm to find the 
best u> which optimizes F{uo). 

4.3.1 Two challenges: computation and criterion 

There are two main challenges. The first one is computation. With d potential predictors, the size of 
Q is |f2| = 2 d . This gets large very quickly. For example, take d = 100 and suppose we can evaluate 
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a billion (10 9 ) subsets per second. How long will it take us to evaluate all of them? The answer is 
about 40,000 billion years: 

2 100 ^_ 1Q 9 ^ 36 QQ ^ 24 ^ 3 g 5 _ qqq x 10 9 

This may seem serious, but it actually is not the problem we shall be concerned about here. Everyone 
must face this problem; there is no way out — just yet. For moderately large d, exhaustive search 
is impossible; stepwise or heuristic search algorithms must be used. 

The second challenge is more substantial, especially for statisticians, and that's the question of 
what makes a good evaluation criterion, F. It is well-known that both the AIC and the BIC are 
problematic in practice. Roughly speaking, with finite data, the AIC tends to favor subsets that are 
too large, while the BIC tends to favor ones that are too small. For classic linear models, both the 
AIC and the BIC have the form: 

F(oj) = goodncss-of-fit(cj) + j\u>\, 

where \ui\ is the size of w, or the number of variables included. The AIC uses 7 = 2 whereas the 
BIC uses 7 = log(n), n being the sample size. Therefore, it appears that 7 = 2 is too small and 
7 = log(n) is too big. But if this is the case, surely there must be a magic 7 somewhere in between? 
So why not find out what it is? While this logic is certainly quite natural, it by no means implies 
that the task is easy. 

4.3.2 Darwinian evolution in parallel universes 

The product from my R&D division is a very simple yet surprisingly effective method for variable 
selection by using Darwinian evolution in parallel universes (Zhu and Chipman 2006). 

Here is how the algorithm works in a nutshell. Create a number of parallel universes. In 
each universe, run an evolutionary algorithm using the (apparently incorrect) AIC as the objective 
function for just a few generations — the evolutionary algorithm is a heuristic stochastic search 
algorithm that mimics Darwin's "natural selection" to optimize any given objective function (see, 
e.g., Goldberg 1989). Whatever it is, there will be a current best solution in each universe when we 
stop. For example, the current best subset in universe 1 may be {23, x&, xio}; in universe 2, it may 
be {xi, X3, x$, £15}; in universe 3, perhaps {X3, X5, x 8 , Xn}; and so on. These form an ensemble. 
Now take a majority vote and select those variables that show up in significantly more universes 
than the rest. In the example here, this would be {x3,x§} — and that's the answer. 

4.3.3 Explanation with a toy example 

Why does this simple strategy work? A small toy example is enough to illustrate the gist of the 
idea. Generate 

Vi = %i,2 + Xi, 5 + x i>6 + ej, x i; i, ...,Xi 4 o,£i *~ N(0, 1), i = 1,2, ...,50. 

In other words, there are 10 potential predictors but the true model contains only 3 of them: X2, X5, 
and x$- With just 10 variables, there are altogether 2 10 = 1,024 subsets, and we can still afford to 
exhaustively compute the AIC for each one of them. Figure 5 plots the AIC versus the size for all 
2 10 possible subsets. A number of characteristic observations can be made: 

1. The subset that has the smallest AIC is wrong; it includes a few variables too many. 
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2. On the AIC scale, many subsets are very close to each other and it is hard to tell them apart. 

3. Let's separate the 2 10 subsets into two groups. Group I consists of those that include all the 
true variables — they are labelled with circles (o) in the plot. Group II consists of those that 
miss out on at least one of the true variables — they are labelled with crosses (x), pluses (+), 
and triangles (A). Then, on the AIC scale, a significant gap exists between these two groups. 

Having made these observations, we are now ready to explain why parallel evolution works. The 
large gap between group I and group II (observation 3) means that members from group I arc 
significantly superior and hence easily favored by evolution. Therefore, after evolving for just a few 
generations, the current best subset in each universe is likely a member from group I. They are 
the ones that make up our ensemble. What do they have in common? They all include the 3 true 
variables. 



AIC of All 2 10 Models 




i i i i i i i i i i r 

01 23456789 10 

Model Size 



Figure 5: Why does parallel evolution work? For what this figure tells us, see Section 4.3.3. 

But in order for majority vote to be effective in selecting the right variables, it is necessary that 
the true variables are the only thing that these ensemble members have in common. That's why 
we can't run the evolution for too long in each universe. With a short evolution, since members 
of group I are hard to distinguish from each other on the AIC scale (observation 2), the random 
nature of evolution will cause each universe to settle on different members from this group. If, on 
the other hand, we run the evolution for too long, the current best subsets from different universes 
will start to develop something else in common — they will all start to converge to the minimum 
AIC solution, which includes spurious variables (observation 1). 
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Figure 6 illustrates how parallel evolution works on this toy example. After running the evolu- 
tionary algorithm for just 6 generations in each universe, we measure the importance of a variable 
by how often it shows up across the parallel universes. The correct solution for this example is 
{2, 5, 8}. When a single universe is used (B = 1), we get the wrong solution — a spurious variable, 
namely variable 6, also shows up. But as more and more parallel universes are used, only the truly 
important variables, i.e., variables 2, 5 and 8 in this case, can "survive" the majority vote. We can 
see from Figure 6 that when as few as B = 10 universes are used, the correct solution is already 
easily discernible: out of the 10 universes, variables 2, 5, and 8 each showed up at least 9 times; 
variable 6 showed up 4 times; and all other variables showed up at most twice. 
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Figure 6: Parallel evolution on the toy example (Section 4.3.3). The correct solution for this example 
is {2, 5, 8}. When B = 10 parallel universes are used, the correct solution is already easily discernible. 



4.3.4 Discussion: Parallel evolution is an easy-to-use ensemble method 

Parallel evolution for variable selection is a successful example of using ensembles in a very different 
context. By using an ensemble, we can significantly "boost up" the performance of an apparently 
wrong variable selection criterion such as the AIC. The procedure is very easy to use. Most im- 
portantly, it is trivial to adapt this principle to general variable selection problems regardless of 
whether the underlying model is a classic linear model, a generalized linear model, a generalized 
additive model, a Cox proportional hazard model, or any other model for which the question of 
variable selection is meaningful. As such, it is not unfair to call parallel evolution a first-generation, 
foolproof, off-the-shelf variable selector. 

A number of smart statisticians have questioned whether it is necessary to use the evolutionary 
algorithm. For example, one can apply Breiman's Bagging principle and create an ensemble as 
follows: Draw a bootstrap sample of the data. Using the bootstrap sample, run a stepwise algorithm 
to optimize the AIC and choose a subset. Do this many times, and we get an ensemble of subsets. 
Take majority vote. Clearly, this would also work. I have experimented with this idea and found that 
it is not as effective; the probability of selecting the right subset of variables decreases significantly 
in simulation experiments. Why? Breiman's theorem (Section 3.3) points us to an answer. Because 
bootstrapping alone does not create enough diversity within the ensemble. These subsets share too 
many things in common with the minimal AIC solution. 
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4.4 Section Summary 

In this section, I have discussed a new kernel-based algorithm for rare target detection, LAGO, and a 
new ensemble method for variable selection based on parallel evolution. In doing so, a more general 
formulation of LAGO is presented (Section 4.2) using much better mathematical notation, e.g., 
equation (22). A simpler version, sLAGO, is given for the first time. Better explanations are also 
given for why parallel evolution (Section 4.3) works, e.g., Figure 5. Many people have the incorrect 
understanding that parallel evolution is merely a better search algorithm for variable selection. This 
is simply not true. In Section 4.3, it is emphasized that, instead of a better search algorithm, parallel 
evolution is actually an ensemble method that boosts up the performance of an apparently incorrect 
search criterion such as the AIC. 

5 Conclusion 

So, what have we learned? First of all, we learned that, by using kernel functions, we can use many 
linear algorithms such as separating hypcrplanes and principal component analysis to find nonlinear 
patterns (Section 2). This easily can be done as long as the underlying linear algorithm can be 
shown to depend on the data only through pairwise inner-products, i.e., x^Xj. Then, we simply 
can replace the inner-product xf Xj with a kernel function Kh(x-i;Xj). However, even though such a 
framework is straightforward, we also learned that it is important in practice to use the right kernel 
function Kh and to carefully select the hyperparameter h (Section 2.5). We saw that this is not 
necessarily an easy task (Section 2.5.1). 

We then learned about ensemble methods (Section 3). The fundamental idea there is to use a 
collection of perhaps not-so- well-tuned models rather than a single model that often requires careful 
fine-tuning. This usually makes ensemble methods easier to use for non-experts. I then emphasized 
that, even for ensembles, it is necessary to perform some fine-tuning (Section 3.5) — this typically 
involves creating the right amount of diversity in the ensemble (Section 3.4 and 3.5). However, we 
saw that fine-tuning an ensemble algorithm is often easier than fine-tuning a kernel-based algorithm 
(Section 3.5.1). 

I then argued (Section 4.1) that kernel methods and ensemble methods need to co-exist in 
practice. In particular, non-experts may tend to prefer ensemble methods because they are easier 
to use, whereas experts may tend to prefer kernel methods because they provide more flexibility 
for solving nonstandard and difficult problems (Sections 2.5 and 3.5). Hence, it is important for 
researchers in statistical machine learning to advance both types of methodology. I then presented 
some of my own research on both fronts: LAGO, a fast kernel machine for rare target detection; 
and Darwinian evolution in parallel universes, an ensemble method for variable selection. 
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